ANALYSIS OF LOAN DATA by ANTONIO SANCHEZ HEVIA

Introduction

The aim of this document is to provide insight into a data set of loan data provided by Prosper. The main goal of the analysis is to find correlations between the interest rate and other existing variables in the data set, such as income range or occupation. This document also contains and analysis of different variables related to the top and bottom 10 Occupation by borrower rate. The purpose of this analysis is to understand the reasons behind this individuals getting lower borrower rates. Correlations between variable such as annual income or properties owned (home) and borrower rate have been analysed.

The conclusions of this analysis are detailed on last section.

Out of all the existing variables in the data set, only the following are subject to analysis:

1.BorrowerRate

2.LoanOriginationDate

3.Occupation

4.CreditGrade

5.ProsperScore

6.ProsperRating..Alpha.

7.BorrowerState

8.IsBorrowerHomeowner

9.IncomeRange

10.EmploymentStatus

11.StatedMonthlyIncome

12.ListingCategory..numeric.

13.LoanOriginalAmount

14.LoanOriginationDate

Univariate Plots Section

Year of the loan

The loan origination year is analysed:

Seems like the number of loans increased substantially from 2005 to 2008, decreasing at 2009 and recovering at 2011. From 2011 to 2013 there is a dramatic increase in the number of loans originated, hitting a maximum on 2013.

The Listing Category

As shown on the graph above, Debt consolidation is the most predominant category by a significant difference.The following most relevant categories are Not Available and Other. Then Home Improvement and Business are the most relevant categories

Occupation

Categories Other and Professional are the most common Occupations. None of them is very descriptive. The remaining Occupations range from 1000 to 3000 counts.

Borrower Rate

Below there is a scatter plot describing the distribution of borrower rates across the data.

The majority of rates are within a range of 0.05 to 0.35, being around 0.15 the most common rate.

Credit Grade

Below there is a histogram displaying the credit grades across the data set

As shown on the graph above, the distribution is normalized, being a grade of C the most common grade.

ProsperRating..Alpha.

The rating provided by Prosper contains the ratings after July 2009

The information contained on both columns, Credit Grade and ProsperRating..Alpha. is complementary:

##       ProsperRating..Alpha. CreditGrade
## 49990                                 C
## 49991                     B            
## 49992                     A            
## 49993                     C            
## 49994                                NC
## 49995                     B            
## 49996                                AA
## 49997                                 E
## 49998                                HR
## 49999                     B

Thus, an extra column has been created containing the information of the two columns combined.

Again, the histogram above is normally distributed, being C the most common rating.

Is the borrower a home owner?

After analyzing the variable IsBorrowerHomeowner the results are:

## 
## False  True 
## 56459 57478

There is almost a 50-50 distribution across the data set.

Income Range

Below there is a histogram displaying the credit grades across the data set

The most predominant income range is from 25,000 to 49,999 followed by the range 75,000 to 99,999. A surprising 15% of the individuals in the data earn more than $100,000.

Employment Status

As we can see in the graph above, the majority of the individuals in the data set are employed. Unemployed individuals (Not Employed and Retired) represent a minority.

Loan Original Amount

Given that there are some outliers limits have been set on the x axis.

The majority of loans (75%) original amount is under $12,000, being around $4000 the most common amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Univariate Analysis

Dataset structure

Each instance on the data set has extensive details about a loan, its conditions and the profile of the borrower issuing it. There are 81 variables available for each instance out, of which only 14 are going to bee analyzed. variables can be grouped into loan variables (interest rates, state, origination date, original amount…) and person variables (income range, occupation, is home owner…)

Aim

The main aim of the analysis is to find correlations between the borrower rate and the other variables. What is the reason behind some individuals getting lower borrower rates than others?

The more financially secure the individual is the more likely the individual will get a lower borrowing rate. Owning a house or having a high salary add to financial security. Also past behavior of the individual, as having delinquencies on the record, is likely to affect the borrowing rate. Such information can be found on the following variables: IncomeRange, Occupation, DelinquenciesLast7Years or CreditGrade.

Modifications to the data set

The information available about credit grades was split into two columns. An extra column has been created containing the combination of both columns.

Bivariate Plots Section

Credit Grade vs Median Borrower Rate

Then the corresponding bar chart is created

It is no surprise that the Credit grades correlate linearly with the Median Borrower Rate. The better the rating the lower the borrower rate.

Borrower Rate across years

Observing the Loan Origination Year histogram on previous section and the plot above, it seems that the mean borrower rate is a good predictor of the number of loans issued each year.

Owning a Home vs Borrow Rate

## 
##  0.17 0.198 
##     1     1

The Borrower rate median is almost 2 points lower for individuals that own a home (median) than for individuals who does not.

Occupation vs Borrow Rate

The graph below represents the median aggregated borrower rate by Occupation.

Seems that Higher Education jobs correlate with lower Borrower rate. Judge, Doctor and Pharmacist are the occupations with the lowest aggregated median borrower rate whereas Teacher´s Aide, Nurse´s Aide and Student - College Freshman are getting the highest.

Bivariate Analysis

On this section it has been analysed the correlation between Credit Grade and Borrow Rate. It is no surprise that better credit grades correlate with lower borrower rates. Also owning a house improves the borrower rates.

The variation of aggregated borrower rates throughout the years available on the data set had also been analysed. The mean borrower rates decreased from 2006 to 2008, hitting a minimum on that year. After 2008, when the financial crisis cracked, the rates increased steadily for 3 years, topping at 2011.

On the last section of this analysis, the relationship between Borrower Rates and Occupation was analysed. Taking the aggregated mean borrower rate per Occupation it was found that Higher education jobs correlated with lower borrowing rates. Also lower education jobs, such as Teacher’s Aide, correlated with higher borrowing rates.

Next section is going to be focused to deepen on that issue. Top and bottom 10 Occupations by borrowing rates are going to be analysed in detail.

Multivariate Plots Section

As mentioned on previous section, the top and bottom 10 Occupations by median borrower rate are going to be analyzed in further depth.

For each Occupation set the following variables are going to be analysed: * Credit Grade * Income Range * Is home owner? * Delinquencies on last 7 years * Loan Original Amount

Credit Grade

Given that the Occupations were classified by Borrowing rate, and it has been shown that borrowing rate correlates with Credit Grades, it is no surprise that Top 10 Occupations get better credit grades. The mode Credit Grade for top 10 occupations is A wheres for bottom 10 occupations is two grades a lower, a C.

Income Range

Individuals that are in the top 10 occupations group earn significantly more than the bottom 10 group individuals. The mode for top 10 occupations group is $100,000+ wheres for the bottom 10 occupations group is $25,000-49,999.

Is home owner?

Again there is a significant difference between the two groups. The proportion of individuals among top 10 occupations group that own a home is 20% higher.

Delinquencies in last 7 years

Individuals on the bottom 10 Occupations group are more likely to have delinquencies on past 7 years. There is a 12.6% difference between the two groups.

Top 10 Occupations delinquencies summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.932   0.000  99.000       4

Bottom 10 Occupations delinquencies summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    4.66    4.00   99.00      29

Out of the individuals with delinquencies on their record, those in the bottom 10 groups have a higher number of delinquencies on average. The mean of delinquencies is around 50% higher.

Loan Original Amount

Top 10 Occupations Loan Original Amount summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    8500   10121   15000   35000

Bottom 10 Occupations Loan Original Amount summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4000    6247    9500   35000

Loan Original Amount Proportions

Although the top 10 Occupations group is getting lower borrower rates, by observing the median loan original amount, they seem to be getting loans of an original amount around twice as bigger as those for the bottom 10 occupations group.

Analysis by State

Most of the states fall within the same linear correlation between Grade and Borrower Rate, however, there are some exceptions. Iowa and Maine seem to get exceptionally lower rates for the given credit grade.

Why Iowa and Maine are getting cheaper rates?

Iowa

Maine

Being Not Available the most common listing category for both states, it is difficult to identify any specific listing category that is getting better Borrower rates. Despite of that, the borrower rates for each listing categories seem to correlate with the credit Grade obtained.

In Maine seems that business get worse Borrower rates than other listing categories.

Using GGally to explore more variables variables

Listing Categories do not provide much insight explaining the differences in Borrower rates among states. Thus to explain the differences between Maine, Iowa and the rest of the States, more variables have been analysed using ggally. The variables analysed are:

1.BorrowerRate 2.LoanOriginationYear 3.IsBorrowerHomeowner 4.IncomeRange 5.EmploymentStatus 6.OpenRevolvingAccounts 7.Recommendations 8.Investors

Maine Variables

All States But Maine Variables

Comparaison

As observed on previous sections, both owning a home and the loan origination year correlate with Borrower rates.

In the case of Maine, the majority of loans were originated is 2008, whereas for the rest of the data the majority of loans were originated 2013. As observed on previous sections, the median borrower rate is higher in 2013 than 2008. That accounts for some of the differences in Borrower rate observed between Maine and the rest of states. Also there are more home owners in Maine than in the rest of the states.

Iowa Variables

All States But Iowa Variables

Comparaison

The results obtained are similar to those obtained from Maine variables. The majority of loans were created in 2008 and the majority of borrowers own a house.

Multivariate Analysis

On this section multiple variables have been analysed with the purpose of finding correlations with the borrower rate. Following the same procedure as in previous section the data has been grouped by top and bottom 10 Occupations groups sorted by borrower rate.

Individuals on the top 10 Occupations group earn more, are more likely to own a house and have less delinquencies on their record. On the other hand, individuals on the bottom 10 Occupations group earn less, are less likely to own a house and have more delinquencies on their record.

Aside from the classification by Occupation, it also has been analysed how borrower rate varied by state. The data has been grouped by state and the aggregated mean borrower rate and mean credit grade (numeric) have been calculated. All states fell within a

The corresponding representation for each state of borrower rate vs credit grade fall within the same line for most of states, with the exception of Iowa and Maine, which have extraordinary low borrower rates for the given mean credit grade. Therefore, these states have been analysed more in depth, so to understand the primary reason behind these differences. Seems that in the data instances representing those states, there is a higher number of home owners than in the aggregated mean of the remaining states. In addition, in these states the majority of loans were created on 2008, year in which the mean borrower rates were lower than in 2013 (year the majority of loans were created for the rest of states).


Final Plots and Summary

The data has been grouped by Occupation and then compared by its aggregated mean borrower rate. The occupations that get the best and worst borrower rates are represented on the table below:

Top 10 Occupations Bottom 10 Occupations
Judge Teacher’s Aide
Doctor Nurse’s Aide
Pharmacist Student - College Freshman
Computer Programmer Administrative Assistant
Engineer - Electrical Bus Driver
Scientist Clerical
Attorney Laborer
Engineer - Chemical Student - Community College
Pilot - Private/Commercial Food Service
Military Officer Sales - Retail

All the top 10 Occupations are higher education Occupations whereas the bottom 10 are not. To better Understand the reasons behind this individuals getting better rates the following variables have been analysed in relation to the borrower rate:

Income Range Differences Plot

Description

There is an outstanding difference in income ranges between the two groups. The majority of individuals belonging to the top 10 Occupations group are in the $100,000+ range. However, the majority of individuals on the bottom 10 Occupations group earn between $25,000 and $49,999. It can be assumed that at least individuals on the top 10 group earn at least twice as much per year.

Home Owner Differences Plot

Description

The proportion of individuals from Top 10 Occupations group that own a home is significantly higher. Around 60% of those own a home on average, whereas only less than 40% of individuals from bottom 10 occupations group do. That accounts for more than a 20% difference.

Delinquencies In Past 7 Years Plot

Description

Again there is a significant difference between the two groups. Only 23,5% individuals from the top 10 occupations group have delinquencies on their record from past 7 years, compared to the 35,6% from the other group.


Reflection

The Prosper loan data set contains information on 113,937 loans across 81 variables.From the start of the analysis I was interested in the borrower rates and what other variable could correlate to it. Out of the 81 available variables, I only analysed in depth those that I considered would correlate to the borrower rate, such as the income range.

On the second section, by observing the occupations that got the best rates I realized that individuals with higher educations occupations seems to get the best rates. Then I carried on with the analysis, grouping the instances on the data set by top and bottom 10 occupations. Apart from having higher rank occupations, what other reasons could explain the better rates individuals on this group got.

These individuals with a better borrower rate and also higher education jobs, are also more likely to own a house, have a higher income and are less likely to have delinquencies on their record on the past 7 years. Individuals who got the worst Borrowing rates also have Low Education jobs. They are less likely to own a house and are more likely to have delinquencies on their record on the last 7 years. Out of the individuals with delinquencies from both groups, top and bottom, those belonging to the latter have on average a number of delinquencies 50% higher compared to those belonging to the former.

To my surprise, the group of individuals getting lower borrower rates were getting loans which original amount, taking the median, were twice as large as those from the group getting higher borrower rates.

Income range, likelihood of owning a house and delinquencies on the record are surely some of the variables that account for the differences in borrower rates between the two groups. However, to better understand the reasons behind these differences more variables have to be taken into account in future analysis. In addition, the current analysis has been based on the top and bottom 10 occupations by borrower rate, which account for around 20% of all the data available on the data set. Using the remaining 80% of the data to explore the very variables on which this analysis has been centered, could lead to further insights and conclusions about these variables.